LS4003 R tutorial 4

Chi-squared test in R

In this tutorial we’re going to use R to calculate Chi-squared and Fishers test results for categorical data.

Make sure you’ve completed the tutorial 1 section on using R from excel before starting here.

Install and Set-up

A refresher for how to install and set up R and RStudio.

To get set up, follow the below steps. Click each step to see the instruction and the screenrecording.

  1. Type in AppsAnywhere to the windows bar. This will open in a web browser
  2. Type in RStudio in AppsAnywhere
  3. Click “Launch” and wait for it to install and open.

GIF of Opening RStudio

GIF of Opening RStudio
  1. Copy and paste the following into the Console on the left, and press enter.

setwd("O:/")

  1. Click the “More” cog and select “Go To Working Directory”

You should now be in your OneDrive. You should be able to recognise the files and folders listed, from what you have saved here in your other classes.

GIF of Opening OneDrive

GIF of Opening OneDrive

Occasionally there is an issue with how OneDrive is loaded on the University computer.

If you get the error message:

Error in setwd("O:/") : cannot change working directory

Then try the following. Replace the underscores with your K number.

setwd("C:/Users/K______/")

Click the “More” cog and select “Go To Working Directory”

Then find and click on the folder:

OneDrive Kingston University

Click the more cog and select Set As Working Directory

GIF of Opening OneDrive with common error

GIF of Opening OneDrive with common error

If you don’t already have a folder for LS4003 Statistics, then you can create one by clicking “New Folder” and entering a name.

If your new folder doesn’t appear, click the refresh button (to the right of the more cog).

Then:

  1. Click into your new folder

  2. Click the More Cog and select “Set As Working Directory”

GIF of Making a Folder

GIF of Making a Folder

Once you’re in your folder you can create and save an R file. This is where you put your code.

  1. Click on the Green Plus icon and select “R Script”
  2. In the top bar, click “File” and then “Save”
  3. Give your file a name (e.g. “R_tutorial_1” )

When you make any changes, you can save the file by going File -> Save.

You can also save by holding down Control and S at the same time.

GIF of Making an R File

GIF of Making an R File
Note

This will automatically add the “.R” extension so we know it’s an R file - R_tutorial_1.R

Warning

Make sure you can find your file in file explorer. Always back up your work such as saving in OneDrive or emailing to yourself so that you don’t lose your progress.

You’re now ready to run some R code!

  1. Copy and paste the following into your R file:
value <- "Hello World"
value
  1. Highlight both lines and click the “Run” icon (green arrow)

You should see a result in your console (bottom left panel) and your environment (top right panel)

You’re now ready to work through the worksheet! As you go, try and figure out what each bit of code is doing. What happens if you change something?

GIF of Running an R File

GIF of Running an R File

This is an online, cloud-based option. It’s a bit more limited than running on a university computer or your own computer, but the free option should be enough for this module.

Go to Posit Cloud and create a free account

Log in, then go to New Project -> New RStudio Project.

Make a new folder in the bottom right panel (by clicking the New Folder button) called “LS4003_Statistics”.

Click on this folder to enter it, and then click the More cog (bottom right panel) and select “Set as Working Directory”.

To run R on your own machine, you have to install R (the programming language) and RStudio (the development environment).

When installing, click the most appropriate option for your machine (Windows/Mac/Linux)

Install R

Install RStudio

Once you have installed both, open RStudio.

Navigate to your Documents folder in bottom right panel. (If you can’t find it, type in setwd("~/Documents") to the console on the bottom left, then click the More cog on the bottom right and select “Go to Working Directory”)

Create a new folder called LS4003_Statistics by clicking the New Folder button on the right hand side.

Click on your folder (LS4003_Statistics) to enter it.

Set that as your final working directory by clicking on the ‘More’ cog icon again and select “Set as Working Directory”.

Dataset - Portugese secondary school students

For this tutorial we need the “student-mat.csv” file from the canvas page. This is from a study looking factors that may affect maths students at a school in Portugal, which includes lots of categorical data.

This dataset contains the following information about each student:

Column Data
school student’s school (‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex student’s sex (‘binary: ’F’ - female or ‘M’ - male)
age student’s age (numeric: from 15 to 22)
address student’s home address type (‘U’ - urban or ‘R’ - rural)
famsize family size ( ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus parent’s cohabitation status (‘T’ - living together or ‘A’ - apart)
Medu mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Fedu father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
Mjob mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup extra educational support (binary: yes or no)
famsup family educational support (binary: yes or no)
paid extra paid classes within the course subject (binary: yes or no)
activities extra-curricular activities (binary: yes or no)
nursery attended nursery school (binary: yes or no)
higher wants to take higher education (binary: yes or no)
internet Internet access at home (binary: yes or no)
romantic with a romantic relationship (binary: yes or no)
famrel quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime free time after school (numeric: from 1 - very low to 5 - very high)
goout going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health current health status (numeric: from 1 - very bad to 5 - very good)
absences number of school absences (numeric: from 0 to 93)
G1 first period grade (numeric: from 0 to 20)
G2 second period grade (numeric: from 0 to 20)
G3 final grade (numeric: from 0 to 20, output target)

Read the data into R and create contigency tables

First we need to read this information and store it in an R dataframe.

For our Chi-Squared and Fisher test however, we need to generate a contingency table. Effectively, we need to choose two categorical variables, and count the occurences of each combination. Luckily R will do this for us, using the table() function.

Let’s first look to see if there’s a difference between sex and whether the student does after-school activities.

Odds Ratio

We can calculate the odds ratio for our comparison using the oddsratio() function from the epitools package.

You’ll first need to install.packages('epitools').

This result shows us the odds of male students doing an activity (compared to females) is 1.491 -> they are around 1.5 times more likely to be doing an activity.

Also in this table are the “lower” and “upper” values are the boundaries for the 95% confidence interval. As both values are above one, this indicates increased odds (of males doing an activity compared to females).

We can do the same calculation the other way around, looking at the odds of females doing an activity compared to males.

The options you can give for the rev (reverse) parameter are “neither”, “rows”, “columns” and “both”.

Risk ratio

Risk ratio is very similar to odds ratio (refer back to Statistics Lecture 4 for an explanation of the differences).

We can also calculate risk ratio in R.

This shows the risk of males doing an activity (compared to females) is 1.217.

Chi-squared test

As our grand total of observed values is more than 50 and every observation exceeds 5, we can use the chi squared test.

As the p-value is 0.0597, it is not less than 0.05 and therefore not significant with less than a 5% chance of error.

Fisher test

We can also do a fisher test if we’re looking at categories with a smaller value.

If we compare sex and whether or not the student wants to go to higher education, you’ll see we have very low numbers of students that don’t want to go to higher education in this class.

We can then use the fisher test:

Chi-squared with multiple categories

Finally, let’s look at an example with multiple categories in one of the variables.

Perhaps there is a difference between the reasons students chose the school and which school they attend.

We can then run the chi-squared test exactly the same as before:

The p-value is 0.00595 (is this less than 0.05?)

We can visualise which factors had the biggest contribution to the result by doing the following calculation:

\(100 \times \frac{residuals^2}{statistic}\)

We can then visualise this with corrplot.

There is an issue with web r so this plot isn’t interactive

library(corrplot)
corrplot 0.95 loaded
# Run calculation of contributions
contrib <- 100*chisq$residuals^2/chisq$statistic

# Visualise with corrplot
corrplot(contrib, is.cor = FALSE)

Testing a subset of the values

The strongest contributions appear to be in the “reputation” and “other” categories.

To see if these are statistically significant, we can create a new database with just these values so we can then compare them.

Once we have our contigency table, we can then run our chi-squared test:

Extension

This was a very large dataset and we only looked at a few columns.

What else can you find out?

Not all of the columns are categorical data - if you want to compare any other two variables (e.g. whether they went to nursery and their current grades); which test would you use?

We’ve covered all the main statistical tests now, so try and apply what you’ve learnt from every tutorial to this database.

Meme of the Hulk being made for a Chi Squared test

Meme of the Hulk being made for a Chi Squared test